This report was generated on 2017-10-05 at 23:08:21
This Python 3 environment comes with many helpful analytics libraries installed It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python (a modified version of this docker image will be made available as part of my project to ensure reproducibility). For example, here's several helpful packages to load in
Import Libraries and Data:
Input data files are available in the "../input/" directory.
Any results I write to the current directory are saved as output.
| parcelid | airconditioningtypeid | architecturalstyletypeid | basementsqft | bathroomcnt | bedroomcnt | buildingclasstypeid | buildingqualitytypeid | calculatedbathnbr | decktypeid | ... | numberofstories | fireplaceflag | structuretaxvaluedollarcnt | taxvaluedollarcnt | assessmentyear | landtaxvaluedollarcnt | taxamount | taxdelinquencyflag | taxdelinquencyyear | censustractandblock | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10754147 | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 9.0 | 2015.0 | 9.0 | NaN | NaN | NaN | NaN |
| 1 | 10759547 | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 27516.0 | 2015.0 | 27516.0 | NaN | NaN | NaN | NaN |
| 2 | 10843547 | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 650756.0 | 1413387.0 | 2015.0 | 762631.0 | 20800.37 | NaN | NaN | NaN |
| 3 | 10859147 | NaN | NaN | NaN | 0.0 | 0.0 | 3 | 7 | NaN | NaN | ... | 1.0 | NaN | 571346.0 | 1156834.0 | 2015.0 | 585488.0 | 14557.57 | NaN | NaN | NaN |
| 4 | 10879947 | NaN | NaN | NaN | 0.0 | 0.0 | 4 | NaN | NaN | NaN | ... | NaN | NaN | 193796.0 | 433491.0 | 2015.0 | 239695.0 | 5725.17 | NaN | NaN | NaN |
5 rows × 58 columns
| parcelid | logerror | transactiondate | |
|---|---|---|---|
| 0 | 11016594 | 0.0276 | 2016-01-01 |
| 1 | 14366692 | -0.1684 | 2016-01-01 |
| 2 | 12098116 | -0.0040 | 2016-01-01 |
| 3 | 12643413 | 0.0218 | 2016-01-02 |
| 4 | 14432541 | -0.0050 | 2016-01-02 |
| parcelid | airconditioningtypeid | architecturalstyletypeid | basementsqft | bathroomcnt | bedroomcnt | buildingclasstypeid | buildingqualitytypeid | calculatedbathnbr | decktypeid | ... | structuretaxvaluedollarcnt | taxvaluedollarcnt | assessmentyear | landtaxvaluedollarcnt | taxamount | taxdelinquencyflag | taxdelinquencyyear | censustractandblock | logerror | transactiondate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17073783 | NaN | NaN | NaN | 2.5 | 3.0 | NaN | NaN | 2.5 | NaN | ... | 115087.0 | 191811.0 | 2015.0 | 76724.0 | 2015.06 | NaN | NaN | 61110022003007 | 0.0953 | 2016-01-27 |
| 1 | 17088994 | NaN | NaN | NaN | 1.0 | 2.0 | NaN | NaN | 1.0 | NaN | ... | 143809.0 | 239679.0 | 2015.0 | 95870.0 | 2581.30 | NaN | NaN | 61110015031002 | 0.0198 | 2016-03-30 |
| 2 | 17100444 | NaN | NaN | NaN | 2.0 | 3.0 | NaN | NaN | 2.0 | NaN | ... | 33619.0 | 47853.0 | 2015.0 | 14234.0 | 591.64 | NaN | NaN | 61110007011007 | 0.0060 | 2016-05-27 |
| 3 | 17102429 | NaN | NaN | NaN | 1.5 | 2.0 | NaN | NaN | 1.5 | NaN | ... | 45609.0 | 62914.0 | 2015.0 | 17305.0 | 682.78 | NaN | NaN | 61110008002013 | -0.0566 | 2016-06-07 |
| 4 | 17109604 | NaN | NaN | NaN | 2.5 | 4.0 | NaN | NaN | 2.5 | NaN | ... | 277000.0 | 554000.0 | 2015.0 | 277000.0 | 5886.92 | NaN | NaN | 61110014021007 | 0.0573 | 2016-08-08 |
5 rows × 60 columns
Large Negative Error 18442 Small Error 18432 Medium Negative Error 17973 Large Positive Error 17947 Medium Positive Error 17481 Name: logerror_bin, dtype: int64
(90275, 3)
Distribution of Target Variable:
/Users/marskar/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated This is separate from the ipykernel package so we can avoid doing imports until /Users/marskar/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py:179: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self._setitem_with_indexer(indexer, value)
Log-errors are close to normally distributed around a 0 mean, but with a slightly positive skew. There are also a considerable number of outliers, I will explore whether removing these improves model performance.
Proportion of Missing Values in Each Column:
/Users/marskar/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2728: DtypeWarning: Columns (22,32,34,49,55) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
parcelid airconditioningtypeid architecturalstyletypeid basementsqft \ 0 10754147 NaN NaN NaN 1 10759547 NaN NaN NaN 2 10843547 NaN NaN NaN 3 10859147 NaN NaN NaN 4 10879947 NaN NaN NaN bathroomcnt bedroomcnt buildingclasstypeid buildingqualitytypeid \ 0 0.0 0.0 NaN NaN 1 0.0 0.0 NaN NaN 2 0.0 0.0 NaN NaN 3 0.0 0.0 3.0 7.0 4 0.0 0.0 4.0 NaN calculatedbathnbr decktypeid ... numberofstories \ 0 NaN NaN ... NaN 1 NaN NaN ... NaN 2 NaN NaN ... NaN 3 NaN NaN ... 1.0 4 NaN NaN ... NaN fireplaceflag structuretaxvaluedollarcnt taxvaluedollarcnt \ 0 NaN NaN 9.0 1 NaN NaN 27516.0 2 NaN 650756.0 1413387.0 3 NaN 571346.0 1156834.0 4 NaN 193796.0 433491.0 assessmentyear landtaxvaluedollarcnt taxamount taxdelinquencyflag \ 0 2015.0 9.0 NaN NaN 1 2015.0 27516.0 NaN NaN 2 2015.0 762631.0 20800.37 NaN 3 2015.0 585488.0 14557.57 NaN 4 2015.0 239695.0 5725.17 NaN taxdelinquencyyear censustractandblock 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN [5 rows x 58 columns] --------------------- (2985217, 58)
1 90026 2 123 3 1 Name: parcelid, dtype: int64
/Users/marskar/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2728: DtypeWarning: Columns (22,32,34,49,55) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
(2985217, 58)
| parcelid | airconditioningtypeid | architecturalstyletypeid | basementsqft | bathroomcnt | bedroomcnt | buildingclasstypeid | buildingqualitytypeid | calculatedbathnbr | decktypeid | ... | numberofstories | fireplaceflag | structuretaxvaluedollarcnt | taxvaluedollarcnt | assessmentyear | landtaxvaluedollarcnt | taxamount | taxdelinquencyflag | taxdelinquencyyear | censustractandblock | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10754147 | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 9.0 | 2015.0 | 9.0 | NaN | NaN | NaN | NaN |
| 1 | 10759547 | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 27516.0 | 2015.0 | 27516.0 | NaN | NaN | NaN | NaN |
| 2 | 10843547 | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 650756.0 | 1413387.0 | 2015.0 | 762631.0 | 20800.37 | NaN | NaN | NaN |
| 3 | 10859147 | NaN | NaN | NaN | 0.0 | 0.0 | 3.0 | 7.0 | NaN | NaN | ... | 1.0 | NaN | 571346.0 | 1156834.0 | 2015.0 | 585488.0 | 14557.57 | NaN | NaN | NaN |
| 4 | 10879947 | NaN | NaN | NaN | 0.0 | 0.0 | 4.0 | NaN | NaN | NaN | ... | NaN | NaN | 193796.0 | 433491.0 | 2015.0 | 239695.0 | 5725.17 | NaN | NaN | NaN |
5 rows × 58 columns
(90275, 3)
/Users/marskar/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated This is separate from the ipykernel package so we can avoid doing imports until
<matplotlib.figure.Figure at 0x1101805c0>
Exception ignored in: <bound method DMatrix.__del__ of <xgboost.core.DMatrix object at 0x10fa6cc50>>
Traceback (most recent call last):
File "/Users/marskar/anaconda3/lib/python3.6/site-packages/xgboost/core.py", line 324, in __del__
_check_call(_LIB.XGDMatrixFree(self.handle))
AttributeError: 'DMatrix' object has no attribute 'handle'
| parcelid | logerror | transactiondate | airconditioningtypeid | architecturalstyletypeid | basementsqft | bathroomcnt | bedroomcnt | buildingclasstypeid | buildingqualitytypeid | ... | numberofstories | fireplaceflag | structuretaxvaluedollarcnt | taxvaluedollarcnt | assessmentyear | landtaxvaluedollarcnt | taxamount | taxdelinquencyflag | taxdelinquencyyear | censustractandblock | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11016594 | 0.0276 | 2016-01-01 | 1.0 | NaN | NaN | 2.0 | 3.0 | NaN | 4.0 | ... | NaN | NaN | 122754.0 | 360170.0 | 2015.0 | 237416.0 | 6735.88 | NaN | NaN | 6.037107e+13 |
| 1 | 14366692 | -0.1684 | 2016-01-01 | NaN | NaN | NaN | 3.5 | 4.0 | NaN | NaN | ... | NaN | NaN | 346458.0 | 585529.0 | 2015.0 | 239071.0 | 10153.02 | NaN | NaN | NaN |
| 2 | 12098116 | -0.0040 | 2016-01-01 | 1.0 | NaN | NaN | 3.0 | 2.0 | NaN | 4.0 | ... | NaN | NaN | 61994.0 | 119906.0 | 2015.0 | 57912.0 | 11484.48 | NaN | NaN | 6.037464e+13 |
| 3 | 12643413 | 0.0218 | 2016-01-02 | 1.0 | NaN | NaN | 2.0 | 2.0 | NaN | 4.0 | ... | NaN | NaN | 171518.0 | 244880.0 | 2015.0 | 73362.0 | 3048.74 | NaN | NaN | 6.037296e+13 |
| 4 | 14432541 | -0.0050 | 2016-01-02 | NaN | NaN | NaN | 2.5 | 4.0 | NaN | NaN | ... | 2.0 | NaN | 169574.0 | 434551.0 | 2015.0 | 264977.0 | 5488.96 | NaN | NaN | 6.059042e+13 |
5 rows × 60 columns
| Count | Column Type | |
|---|---|---|
| 0 | parcelid | int64 |
| 1 | logerror | float64 |
| 2 | transactiondate | datetime64[ns] |
| 3 | airconditioningtypeid | float64 |
| 4 | architecturalstyletypeid | float64 |
| 5 | basementsqft | float64 |
| 6 | bathroomcnt | float64 |
| 7 | bedroomcnt | float64 |
| 8 | buildingclasstypeid | float64 |
| 9 | buildingqualitytypeid | float64 |
| 10 | calculatedbathnbr | float64 |
| 11 | decktypeid | float64 |
| 12 | finishedfloor1squarefeet | float64 |
| 13 | calculatedfinishedsquarefeet | float64 |
| 14 | finishedsquarefeet12 | float64 |
| 15 | finishedsquarefeet13 | float64 |
| 16 | finishedsquarefeet15 | float64 |
| 17 | finishedsquarefeet50 | float64 |
| 18 | finishedsquarefeet6 | float64 |
| 19 | fips | float64 |
| 20 | fireplacecnt | float64 |
| 21 | fullbathcnt | float64 |
| 22 | garagecarcnt | float64 |
| 23 | garagetotalsqft | float64 |
| 24 | hashottuborspa | object |
| 25 | heatingorsystemtypeid | float64 |
| 26 | latitude | float64 |
| 27 | longitude | float64 |
| 28 | lotsizesquarefeet | float64 |
| 29 | poolcnt | float64 |
| 30 | poolsizesum | float64 |
| 31 | pooltypeid10 | float64 |
| 32 | pooltypeid2 | float64 |
| 33 | pooltypeid7 | float64 |
| 34 | propertycountylandusecode | object |
| 35 | propertylandusetypeid | float64 |
| 36 | propertyzoningdesc | object |
| 37 | rawcensustractandblock | float64 |
| 38 | regionidcity | float64 |
| 39 | regionidcounty | float64 |
| 40 | regionidneighborhood | float64 |
| 41 | regionidzip | float64 |
| 42 | roomcnt | float64 |
| 43 | storytypeid | float64 |
| 44 | threequarterbathnbr | float64 |
| 45 | typeconstructiontypeid | float64 |
| 46 | unitcnt | float64 |
| 47 | yardbuildingsqft17 | float64 |
| 48 | yardbuildingsqft26 | float64 |
| 49 | yearbuilt | float64 |
| 50 | numberofstories | float64 |
| 51 | fireplaceflag | object |
| 52 | structuretaxvaluedollarcnt | float64 |
| 53 | taxvaluedollarcnt | float64 |
| 54 | assessmentyear | float64 |
| 55 | landtaxvaluedollarcnt | float64 |
| 56 | taxamount | float64 |
| 57 | taxdelinquencyflag | object |
| 58 | taxdelinquencyyear | float64 |
| 59 | censustractandblock | float64 |
| Column Type | Count | |
|---|---|---|
| 0 | int64 | 1 |
| 1 | float64 | 53 |
| 2 | datetime64[ns] | 1 |
| 3 | object | 5 |
/Users/marskar/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:4: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated after removing the cwd from sys.path.
| column_name | missing_count | missing_ratio | |
|---|---|---|---|
| 5 | basementsqft | 90232 | 0.999524 |
| 8 | buildingclasstypeid | 90259 | 0.999823 |
| 15 | finishedsquarefeet13 | 90242 | 0.999634 |
| 43 | storytypeid | 90232 | 0.999524 |
/Users/marskar/anaconda3/lib/python3.6/site-packages/numpy/lib/function_base.py:3162: RuntimeWarning: invalid value encountered in true_divide c /= stddev[:, None] /Users/marskar/anaconda3/lib/python3.6/site-packages/numpy/lib/function_base.py:3163: RuntimeWarning: invalid value encountered in true_divide c /= stddev[None, :]
assessmentyear 1 storytypeid 1 pooltypeid2 1 pooltypeid7 1 pooltypeid10 1 poolcnt 1 decktypeid 1 buildingclasstypeid 1
/Users/marskar/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated """Entry point for launching an IPython kernel.
| col_labels | corr_values | |
|---|---|---|
| 49 | taxamount | -0.014768 |
| 21 | heatingorsystemtypeid | -0.013732 |
| 43 | yearbuilt | 0.021171 |
| 4 | bedroomcnt | 0.032035 |
| 18 | fullbathcnt | 0.034267 |
| 7 | calculatedbathnbr | 0.036019 |
| 3 | bathroomcnt | 0.036862 |
| 10 | calculatedfinishedsquarefeet | 0.047659 |
| 11 | finishedsquarefeet12 | 0.048611 |
/Users/marskar/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py:179: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self._setitem_with_indexer(indexer, value)
<matplotlib.figure.Figure at 0x112e5ae48>
/Users/marskar/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py:179: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self._setitem_with_indexer(indexer, value)
<matplotlib.figure.Figure at 0x10e1843c8>
/Users/marskar/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated """Entry point for launching an IPython kernel. /Users/marskar/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py:179: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self._setitem_with_indexer(indexer, value)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) ~/anaconda3/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj) 700 type_pprinters=self.type_printers, 701 deferred_pprinters=self.deferred_printers) --> 702 printer.pretty(obj) 703 printer.flush() 704 return stream.getvalue() ~/anaconda3/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj) 393 if callable(meth): 394 return meth(obj, self, cycle) --> 395 return _default_pprint(obj, self, cycle) 396 finally: 397 self.end_group() ~/anaconda3/lib/python3.6/site-packages/IPython/lib/pretty.py in _default_pprint(obj, p, cycle) 508 if _safe_getattr(klass, '__repr__', None) is not object.__repr__: 509 # A user-provided repr. Find newlines and replace them with p.break_() --> 510 _repr_pprint(obj, p, cycle) 511 return 512 p.begin_group(1, '<') ~/anaconda3/lib/python3.6/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle) 699 """A pprint that just redirects to the normal repr function.""" 700 # Find newlines and replace them with p.break_() --> 701 output = repr(obj) 702 for idx,output_line in enumerate(output.splitlines()): 703 if idx: ~/anaconda3/lib/python3.6/site-packages/ggplot/ggplot.py in __repr__(self) 114 115 def __repr__(self): --> 116 self.make() 117 # this is nice for dev but not the best for "real" 118 if os.environ.get("GGPLOT_DEV"): ~/anaconda3/lib/python3.6/site-packages/ggplot/ggplot.py in make(self) 634 if kwargs==False: 635 continue --> 636 layer.plot(ax, facetgroup, self._aes, **kwargs) 637 638 self.apply_limits() ~/anaconda3/lib/python3.6/site-packages/ggplot/stats/stat_smooth.py in plot(self, ax, data, _aes) 75 76 smoothed_data = pd.DataFrame(dict(x=x, y=y, y1=y1, y2=y2)) ---> 77 smoothed_data = smoothed_data.sort('x') 78 79 params = self._get_plot_args(data, _aes) ~/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name) 3079 if name in self._info_axis: 3080 return self[name] -> 3081 return object.__getattribute__(self, name) 3082 3083 def __setattr__(self, name, value): AttributeError: 'DataFrame' object has no attribute 'sort'
<ggplot: (282537350)>
<ggplot: (-9223372036568651127)>
Put a bird on it!
<ggplot: (-9223372036568636612)>
Training Size:(90275, 61) Property Size:(2985217, 58)
parcelid 0 airconditioningtypeid 2173698 architecturalstyletypeid 2979156 basementsqft 2983589 bathroomcnt 11462 bedroomcnt 11450 buildingclasstypeid 2972588 buildingqualitytypeid 1046729 calculatedbathnbr 128912 decktypeid 2968121 finishedfloor1squarefeet 2782500 calculatedfinishedsquarefeet 55565 finishedsquarefeet12 276033 finishedsquarefeet13 2977545 finishedsquarefeet15 2794419 finishedsquarefeet50 2782500 finishedsquarefeet6 2963216 fips 11437 fireplacecnt 2672580 fullbathcnt 128912 garagecarcnt 2101950 garagetotalsqft 2101950 hashottuborspa 2916203 heatingorsystemtypeid 1178816 latitude 11437 longitude 11437 lotsizesquarefeet 276099 poolcnt 2467683 poolsizesum 2957257 pooltypeid10 2948278 pooltypeid2 2953142 pooltypeid7 2499758 propertycountylandusecode 12277 propertylandusetypeid 11437 propertyzoningdesc 1006588 rawcensustractandblock 11437 regionidcity 62845 regionidcounty 11437 regionidneighborhood 1828815 regionidzip 13980 roomcnt 11475 storytypeid 2983593 threequarterbathnbr 2673586 typeconstructiontypeid 2978470 unitcnt 1007727 yardbuildingsqft17 2904862 yardbuildingsqft26 2982570 yearbuilt 59928 numberofstories 2303148 fireplaceflag 2980054 structuretaxvaluedollarcnt 54982 taxvaluedollarcnt 42550 assessmentyear 11439 landtaxvaluedollarcnt 67733 taxamount 31250 taxdelinquencyflag 2928755 taxdelinquencyyear 2928753 censustractandblock 75126 dtype: int64
There are several columns which have a very high proportion of missing values. It may be worth analysing these more closely.
parcelid logerror transactiondate transaction_month 0 11016594 0.0276 2016-01-01 1 4392 12379107 0.0276 2016-01-22 1 4391 12259947 0.0010 2016-01-22 1 4390 17204079 0.0871 2016-01-22 1 4389 12492292 -0.0212 2016-01-22 1
For submission we are required to predict values for October, November and December. The differing distributions of the target variable over these months indicates that it may be useful to create an additional 'transaction_month' feature as shown above. Lets have a closer look at the distribution across only October, November and December.
Proportion of Transactions in Each Month
transaction_month month 1 0.072623 1 2 0.070152 2 3 0.095840 3 4 0.103140 4 5 0.110341 5 6 0.120986 6 7 0.110186 7 8 0.116045 8 9 0.106065 9 10 0.055132 10 11 0.020227 11 12 0.019263 12
This datase contains more transactions occuring in the Spring and Summer months, although it must be noted that some transactions from October, November and December have been removed to form the competition's test set (thanks to nonrandom for pointing this out).
Feature Importance
parcelid logerror transactiondate transaction_month \ 0 11016594 0.0276 2016-01-01 1 1 12379107 0.0276 2016-01-22 1 2 12259947 0.0010 2016-01-22 1 3 17204079 0.0871 2016-01-22 1 4 12492292 -0.0212 2016-01-22 1 airconditioningtypeid architecturalstyletypeid basementsqft bathroomcnt \ 0 1.0 -1.0 -1.0 2.0 1 -1.0 -1.0 -1.0 1.0 2 -1.0 -1.0 -1.0 1.0 3 -1.0 -1.0 -1.0 4.0 4 1.0 -1.0 -1.0 1.0 bedroomcnt buildingclasstypeid ... numberofstories \ 0 3.0 -1.0 ... -1.0 1 2.0 -1.0 ... -1.0 2 3.0 -1.0 ... -1.0 3 4.0 -1.0 ... 2.0 4 3.0 -1.0 ... -1.0 fireplaceflag structuretaxvaluedollarcnt taxvaluedollarcnt \ 0 -1 122754.0 360170.0 1 -1 37095.0 185481.0 2 -1 137012.0 240371.0 3 -1 373100.0 746200.0 4 -1 40729.0 61709.0 assessmentyear landtaxvaluedollarcnt taxamount taxdelinquencyflag \ 0 2015.0 237416.0 6735.88 -1 1 2015.0 148386.0 3051.73 -1 2 2015.0 103359.0 5707.91 -1 3 2015.0 373100.0 8576.10 -1 4 2015.0 20980.0 1056.92 -1 taxdelinquencyyear censustractandblock 0 -1.0 6.037107e+13 1 -1.0 6.037532e+13 2 -1.0 6.037541e+13 3 -1.0 6.111008e+13 4 -1.0 6.037571e+13 [5 rows x 61 columns] --------------------- (90275, 61)
transaction_month airconditioningtypeid architecturalstyletypeid \ 0 1 1.0 -1.0 1 1 -1.0 -1.0 2 1 -1.0 -1.0 3 1 -1.0 -1.0 4 1 1.0 -1.0 basementsqft bathroomcnt bedroomcnt buildingclasstypeid \ 0 -1.0 2.0 3.0 -1.0 1 -1.0 1.0 2.0 -1.0 2 -1.0 1.0 3.0 -1.0 3 -1.0 4.0 4.0 -1.0 4 -1.0 1.0 3.0 -1.0 buildingqualitytypeid calculatedbathnbr decktypeid ... \ 0 4.0 2.0 -1.0 ... 1 7.0 1.0 -1.0 ... 2 7.0 1.0 -1.0 ... 3 -1.0 4.0 -1.0 ... 4 7.0 1.0 -1.0 ... numberofstories fireplaceflag structuretaxvaluedollarcnt \ 0 -1.0 0 122754.0 1 -1.0 0 37095.0 2 -1.0 0 137012.0 3 2.0 0 373100.0 4 -1.0 0 40729.0 taxvaluedollarcnt assessmentyear landtaxvaluedollarcnt taxamount \ 0 360170.0 2015.0 237416.0 6735.88 1 185481.0 2015.0 148386.0 3051.73 2 240371.0 2015.0 103359.0 5707.91 3 746200.0 2015.0 373100.0 8576.10 4 61709.0 2015.0 20980.0 1056.92 taxdelinquencyflag taxdelinquencyyear censustractandblock 0 0 -1.0 6.037107e+13 1 0 -1.0 6.037532e+13 2 0 -1.0 6.037541e+13 3 0 -1.0 6.111008e+13 4 0 -1.0 6.037571e+13 [5 rows x 58 columns] ------------ 0 0.0276 1 0.0276 2 0.0010 3 0.0871 4 -0.0212 Name: logerror, dtype: float64
features importance 0 transaction_month 0.039308 1 airconditioningtypeid 0.006998 2 architecturalstyletypeid 0.000359 3 basementsqft 0.000310 4 bathroomcnt 0.007828
------------
features importance
50 structuretaxvaluedollarcnt 0.083723
25 longitude 0.077608
54 taxamount 0.075427
24 latitude 0.074305
26 lotsizesquarefeet 0.071182
Here we see that the greatest importance in predicting the log-error comes from features involving taxes and geographical location of the property. Notably, the 'transaction_month' feature that was engineered earlier was the 12th most important feature.
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-20-1e807e847848> in <module>() ----> 1 test= test.rename(columns={'ParcelId': 'parcelid'}) 2 #To make it easier for merging datasets on same column_id later NameError: name 'test' is not defined